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ing PLTIGs with non-lexicalized Probabilistic 
Context-Free Grammars (PCFGs) ( JPereira and 
Schabes, 1992| ) and non-hierarchical iV-gram 



tion Grammars (PLTIG), a lexicalized counter- 
part to Probabilistic Context-Free Grammars 
(PCFG), to problems in stochastic natural- 
language processing. Comparing the perfor- 
mance of PLTIGs with non-hierarchical iV-gram 
models and PCFGs, we show that PLTIG com- 
bines the best aspects of both, with language 
modeling capability comparable to iV-grams, 
and improved parsing performance over its non- 
lexicalized counterpart. Furthermore, train- 
ing of PLTIGs displays faster convergence than 
PCFGs. 

1 Introduction 

There are many advantages to expressing a 
grammar in a lexicalized form, where an ob- 
servable word of the language is encoded in 
each grammar rule. First, the lexical words 
help to clarify ambiguities that cannot be re- 
solved by the sentence structures alone. For 
example, to correctly attach a prepositional 
phrase, it is often necessary to consider the lex- 
ical relationships between the head word of the 
prepositional phrase and those of the phrases 
it might modify. Second, lexicalizing the gram- 
mar rules increases computational efficiency be- 
cause those rules that do not contain any ob- 
served words can be pruned away immediately. 
The Lexicalized Tree Insertion Grammar for- 
malism (LTIG) has been proposed as a way 
to lexicalize context-free grammars ( Schabes] 
and Waters, 1994 ). We now apply a prob- 
abilistic variant of this formalism, Probabilis- 
tic Tree Insertion Grammars (PLTIGs), to nat- 
ural language processing problems of stochas- 
tic parsing and language modeling. This pa- 
per presents two sets of experiments, compar- 



models that use the right branching bracketing 
heuristics (period attaches high) as their pars- 
ing strategy. We show that PLTIGs can be in- 
duced from partially bracketed data, and that 
the resulting trained grammars can parse un- 
seen sentences and estimate the likelihood of 
their occurrences in the language. The experi- 
ments are run on two corpora: the Air Travel 
Information System (ATIS) corpus and a sub- 
set of the Wall Street Journal TreeBank cor- 
pus. The results show that the lexicalized na- 
ture of the formalism helps our induced PLTIGs 
to converge faster and provide a better language 
model than PCFGs while maintaining compara- 
ble parsing qualities. Although A^-gram models 
still slightly out-perform PLTIGs on language 
modeling, they lack high level structures needed 
for parsing. Therefore, PLTIGs have combined 
the best of two worlds: the language modeling 
capability of A^-grams and the parse quality of 
context-free grammars. 

The rest of the paper is organized as fol- 
lows: first, we present an overview of the PLTIG 
formalism; then we describe the experimental 
setup; next, we interpret and discuss the results 
of the experiments; finally, we outline future di- 
rections of the research. 

2 PLTIG and Related Work 

The inspiration for the PLTIG formalism stems 
from the desire to lexicalize a context-free gram- 
mar. There are three ways in which one might 
do so. First, one can modify the tree struc- 
tures so that all context-free productions con- 
tain lexical items. Greibach normal form pro- 
vides a well-known example of such a lexical- 
ized context-free formalism. This method is 



not practical because altering the structures of 
the grammar damages the linguistic informa- 
tion stored in the original grammar ( Schabcs] 
and Waters, 1994| ). Second, one might prop- 
agate lexical information upward through the 
productions. Examples of formalisms using this 
approach include the work of Magerman 7l99S|), 
Charniak (19971 ), |Collins (1997| ), and IGoocfl 



man (1997 ). A more linguistically motivated 



approach is to expand the domain of produc- 
tions downward to incorporate more tree struc- 
tures. The Lexicalized Tree- Adjoining Gram- 



mar (LTAG) formalism (Schabes et al., 1988), 
( Schabcs, 1990|) , although not context-free, is 
the most well-known instance in this category. 
PLTIGs belong to this third category and gen- 
erate only context-free languages. 

LTAGs (and LTIGs) are tree-rewriting sys- 
tems, consisting of a set of elementary trees 
combined by tree operations. We distinguish 
two types of trees in the set of elementary trees: 
the initial trees and the auxiliary trees. Unlike 
full parse trees but reminiscent of the produc- 
tions of a context-free grammar, both types of 
trees may have nonterminal leaf nodes. Aux- 
iliary trees have, in addition, a distinguished 
nonterminal leaf node, labeled with the same 
nonterminal as the root node of the tree, called 
the foot node. Two types of operations are used 
to construct derived trees, or parse trees: sub- 
stitution and adjunction. An initial tree can 
be substituted into the nonterminal leaf node of 
another tree in a way similar to the substitu- 
tion of nonterminals in the production rules of 



CFGs. An auxiliary tree is inserted into another 
tree through the adjunction operation, which 
splices the auxiliary tree into the target tree at 
a node labeled with the same nonterminal as 
the root and foot of the auxiliary tree. By us- 
ing a tree representation, LTAGs extend the do- 
main of locality of a grammatical primitive, so 
that they capture both lexical features and hi- 
erarchical structure. Moreover, the adjunction 
operation elegantly models intuitive linguistic 
concepts such as long distance dependencies be- 
tween words. Unlike the iV-gram model, which 
only offers dependencies between neighboring 
words, these trees can model the interaction of 
structurally related words that occur far apart. 

Like LTAGs, LTIGs are tree-rewriting sys- 
tems, but they differ from LTAGs in their gener- 



ative power. LTAGs can generate some strictly 
context-sensitive languages. They do so by us- 
ing wrapping auxiliary trees, which allow non- 
empty frontier nodes (i.e., leaf nodes whose la- 
bels are not the empty terminal symbol) on both 
sides of the foot node. A wrapping auxiliary 
tree makes the formalism context-sensitive be- 
cause it coordinates the string to the left of its 
foot with the string to the right of its foot while 
allowing a third string to be inserted into the 
foot. Just as the ability to recursively center- 
embed moves the required parsing time from 
0(n) for regular grammars to 0(n 3 ) for context- 
free grammars, so the ability to wrap auxiliary 
trees moves the required parsing time further, 
to 0(n 6 ) for tree-adjoining grammars Q. This 
level of complexity is far too computationally 
expensive for current technologies. The com- 
plexity of LTAGs can be moderated by elimi- 
nating just the wrapping auxiliary trees. LTIGs 
prevent wrapping by restricting auxiliary tree 
structures to be in one of two forms: the left 
auxiliary tree, whose non-empty frontier nodes 
are all to the left of the foot node; or the right 
auxiliary tree, whose non-empty frontier nodes 
are all to the right of the foot node. Auxil- 
iary trees of different types cannot adjoin into 
each other if the adjunction would result in a 
wrapping auxiliary tree. The resulting system 
is strongly equivalent to CFGs, yet is fully lex- 
icalized and still 0(n 3 ) parsable, as shown by 
|Schabes and Waters (1994| ). 

Furthermore, LTIGs can be parameterized to 



form probabilistic models ( Schabes and Waters 
1993b|) . Appendix [A| describes the parameters 



in detail. Informally speaking, a parameter is 
associated with each possible adjunction or sub- 
stitution operation between a tree and a node. 
For instance, suppose there are V left auxiliary 
trees that might adjoin into node rj. Then there 
are V + 1 parameters associated with node rj 
that describe the distribution of the likelihood 
of any left auxiliary tree adjoining into node r\. 
(We need one extra parameter for the case of 
no left adjunction.) A similar set of parame- 
ters is constructed for the right adjunction and 



lr The best theoretical upper bound on time complex- 
ity for the recognition of Tree Adjoining Languages is 
0(M(n 2 )), where M(k) is t he time needed to multiply 



two k x k boolean matrices. (Rajasekaran and Yooseph 
1995) ) 



Elementary Tree Sets: 



word 

n 

x 



X word j X word -. X word 

£ * * # n 

Figure 1: A set of elementary LTIG trees that 
represent a bigram grammar. The arrows indi- 
cate adjunction sites. 



Example sentence: 

The cat chases the mouse 
Corresponding derivation tree: 
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substitution distributions. 
3 Experiments 

In the following experiments we show that 
PLTIGs of varying sizes and configurations can 
be induced by processing a large training cor- 
pus, and that the trained PLTIGs can provide 
parses on unseen test data of comparable qual- 
ity to the parses produced by PCFGs. More- 
over, we show that PLTIGs have significantly 
lower entropy values than PCFGs, suggesting 
that they make better language models. We 
describe the induction process of the PLTIGs 
in Section 3J. Two corpora of very different 
nature are used for training and testing. The 
first set of experiments uses the Air Travel In- 



formation System (ATIS) corpus. Section 3.2 
presents the complete results of this set of ex- 
periments. To determine if PLTIGs can scale 
up well, we have also begun another study that 
uses a larger and more complex corpus, the Wall 
Street Journal TreeBank corpus. The initial re- 



sults are discussed in Section 3.3. To reduce the 



effect of the data sparsity problem, we back off 
from lexical words to using the part of speech 
tags as the anchoring lexical items in all the 
experiments. Moreover, we use the deleted- 
interpolation smoothing technique for the N- 
gram models and PLTIGs. PCFGs do not re- 
quire smoothing in these experiments. 

3.1 Grammar Induction 

The technique used to induce a grammar is a 
subtractive process. Starting from a universal 
grammar (i.e., one that can generate any string 
made up of the alphabet set), the parameters 
are iteratively refined until the grammar gen- 
erates, hopefully, all and only the sentences in 
the target language, for which the training data 
provides an adequate sampling. In the case of 
a PCFG, the initial grammar production rule 



Figure 2: An example sentence. Because each 
tree is right adjoined to the tree anchored with 
the neighboring word in the sentence, the only 
structure is right branching. 

set contains all possible rules in Chomsky Nor- 
mal Form constructed by the nonterminal and 
terminal symbols. The initial parameters asso- 
ciated with each rule are randomly generated 
subject to an admissibility constraint. As long 
as all the rules have a non-zero probability, any 
string has a non-zero chance of being generated. 
To train the grammar, we follow the Inside- 
Outside re-estimation algorithm described by 
Lari and Young (1990| ). The Inside-Outside re- 
estimation algorithm can also be extended to 
train PLTIGs. The equations calculating the 
inside and outside probabilities for PLTIGs can 
be found in Appendix [B]. 

As with PCFGs, the initial grammar must be 
able to generate any string. A simple PLTIG 
that fits the requirement is one that simulates 
a bigram model. It is represented by a tree set 
that contains a right auxiliary tree for each lex- 
ical item as depicted in Figure [l|. Each tree has 
one adjunction site into which other right auxil- 
iary trees can adjoin. The tree set has only one 
initial tree, which is anchored by an empty lex- 
ical item. The initial tree represents the start 
of the sentence. Any string can be constructed 
by right adjoining the words together in order. 
Training the parameters of this grammar yields 
the same result as a bigram model: the param- 
eters reflect close correlations between words 
that are frequently seen together, but the model 
cannot provide any high-level linguistic struc- 
ture. (See example in Figure §.) 

To generate non- linear structures, we need to 
allow adjunction in both left and right direc- 
tions. The expanded LTIG tree set includes a 
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Figure 3: An LTIG elementary tree set that al- 
low both left and right adjunctions. 

Example sentence: 

The cat chases the mouse 

Corresponding derivation tree: 




tl.the 



Figure 4: With both left and right adjunctions 
possible, the sentences can be parsed in a more 
linguistically plausible way 

left auxiliary tree representation as well as right 
for each lexical item. Moreover, we must mod- 
ify the topology of the auxiliary trees so that 
adjunction in both directions can occur. We in- 
sert an intermediary node between the root and 
the lexical word. At this internal node, at most 
one adjunction of each direction may take place. 
The introduction of this node is necessary be- 
cause the definition of the formalism disallows 
right adjunction into the root node of a left aux- 
iliary tree and vice versa. For the sake of unifor- 
mity, we shall disallow adjunction into the root 
nodes of the auxiliary trees from now on. Figure 
^ shows an LTIG that allows at most one left 
and one right adjunction for each elementary 
tree. This enhanced LTIG can produce hierar- 
chical structures that the bigram model could 
not (See Figure 

It is, however, still too limiting to allow 
only one adjunction from each direction. Many 
words often require more than one modifier. For 
example, a transitive verb such as "give" takes 
at least two adjunctions: a direct object noun 
phrase, an indirect object noun phrase, and pos- 
sibly other adverbial modifiers. To create more 
adjunction sites for each word, we introduce yet 
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Figure 5: Prototypical auxiliary trees for three 
PLTIGs: (a) L1R2, (b) L2R1, and (c) L2R2. 

more intermediary nodes between the root and 
the lexical word. Our empirical studies show 
that each lexicalized auxiliary tree requires at 
least 3 adjunction sites to parse all the sentences 
in the corpora. Figure ||(a) and (b) show two 
examples of auxiliary trees with 3 adjunction 
sites. The number of parameters in a PLTIG 
is dependent on the number of adjunction sites 
just as the size of a PCFG is dependent on the 
number of nonterminals. For a language with 
V vocabulary items, the number of parameters 
for the type of PLTIGs used in this paper is 
2(V+1)+2V{K)(V+1), where K is the number 
of adjunction sites per tree. The first term of 
the equation is the number of parameters con- 
tributed by the initial tree, which always has 
two adjunction sites in our experiments. The 
second term is the contribution from the aux- 
iliary trees. There are 2V auxiliary trees, each 
tree has K adjunction sites; and V + 1 param- 
eters describe the distribution of adjunction at 
each site. The number of parameters of a PCFG 
with M nonterminals is M 3 + MV. For the ex- 
periments, we try to choose values of K and M 
for the PLTIGs and PCFGs such that 

2(V + 1) + 2V(K)(V + 1) « M 3 + MV 

3.2 ATIS 

To reproduce the results of PCFGs reported by 
Pereira and Schabes, we use the ATIS corpus 



for our first experiment. This corpus contains 
577 sentences with 32 part-of-speech tags. To 
ensure statistical significance, we generate ten 
random train-test splits on the corpus. Each 
set randomly partitions the corpus into three 
sections according to the following distribution: 
80% training, 10% held-out, and 10% testing. 
This gives us, on average, 406 training sen- 
tences, 83 testing sentences, and 88 sentences 
for held-out testing. The results reported here 
are the averages of ten runs. 

We have trained three types of PLTIGs, vary- 
ing the number of left and right adjunction sites. 
The L2R1 version has two left adjunction sites 
and one right adjunction site; L1R2 has one 
left adjunction site and two right adjunction 
sites; L2R2 has two of each. The prototypi- 
cal auxiliary trees for these three grammars are 
shown in Figure [| At the end of every train- 
ing iteration, the updated grammars are used 
to parse sentences in the held-out test sets D, 
and the new language modeling scores (by mea- 
suring the cross-entropy estimates H(D, L2R1), 
H(D,L1R2), and H(D,L2R2)) are calculated. 
The rate of improvement of the language model- 
ing scores determines convergence. The PLTIGs 
are compared with two PCFGs: one with 
15-nonterminals, as Pereira and Schabes have 
done, and one with 20- nonterminals, which has 
comparable number of parameters to L2R2, the 
larger PLTIG. 

In Figure || we plot the average iterative 
improvements of the training process for each 
grammar. All training processes of the PLTIGs 
converge much faster (both in numbers of itera- 
tions and in real time) than those of the PCFGs, 
even when the PCFG has fewer parameters to 
estimate, as shown in Table |l[ From Figure |(| 
we see that both PCFGs take many more iter- 
ations to converge and that the cross-entropy 
value they converge on is much higher than the 
PLTIGs. 

During the testing phase, the trained gram- 
mars are used to produce bracketed constituents 
on unmarked sentences from the testing sets 
T. We use the crossing bracket metric to 
evaluate the parsing quality of each gram- 
mar. We also measure the cross-entropy es- 
timates H(T,L2R1), H(T,L1R2),H(T,L2R2), 
H(T,PCFG 15 ), and H(T,PCFG 20 ) to deter- 
mine the quality of the language model. For 
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Figure 6: Average convergence rates of the 
training process for 3 PLTIGs and 2 PCFGs. 
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PLTIGs bigram 



Table 2: Summary of pair-wise t-test for all 
grammars. If "better" appears at cell then 
the model in row i has an entropy value lower 
than that of the model in column j in a statis- 
tically significant way. The symbol "— " denotes 
that the difference of scores between the models 
bears no statistical significance. 

a baseline comparison, we consider bigram and 
trigram models with simple right branching 
bracketing heuristics. Our findings are summa- 
rized in Table [l|. 

The three types of PLTIGs generate roughly 
the same number of bracketed constituent errors 
as that of the trained PCFGs, but they achieve 
a much lower entropy score. While the average 
entropy value of the trigram model is the low- 
est, there is no statistical significance between it 
and any of the three PLTIGs. The relative sta- 
tistical significance between the various types of 
models is presented in Table 0. In any case, the 
slight language modeling advantage of the tri- 
gram model is offset by its inability to handle 
parsing. 

Our ATIS results agree with the findings of 
Pereira and Schabes that concluded that the 
performances of the PCFGs do not seem to de- 
pend heavily on the number of parameters once 
a certain threshold is crossed. Even though 
PCFG20 has about as many number of param- 
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PCFG 20 
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Number of parameters 


1088 / 34880 


3855 
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6402 
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Iterations to convergence 




45 


45 


19 


17 


24 


Real-time convergence (min) 




62 


142 


8 


7 


14 


H(T, Grammar) 


2.88 / 2.71 


3.81 


3.42 


2.87 


2.85 


2.78 


Crossing bracket (on T) 


66.78 


93.46 


93.41 


93.07 


93.28 


94.51 



Table 1: Summary results for ATIS. The machine used to measure real-time is an HP 9000/859. 



eters as the larger PLTIG (L2R2), its language 
modeling score is still significantly worse than 
that of any of the PLTIGs. 

3.3 WSJ 

Because the sentences in ATIS are short with 
simple and similar structures, the difference in 
performance between the formalisms may not 
be as apparent. For the second experiment, 
we use the Wall Street Journal (WSJ) corpus, 
whose sentences are longer and have more var- 
ied and complex structures. We use sections 
02 to 09 of the WSJ corpus for training, sec- 
tion 00 for held-out data D, and section 23 for 
test T. We consider sentences of length 40 or 
less. There are 13242 training sentences, 1780 
sentences for the held-out data, and 2245 sen- 
tences in the test. The vocabulary set con- 
sists of the 48 part-of-speech tags. We compare 
three variants of PCFGs (15 nonterminals, 20 
nonterminals, and 23 nonterminals) with three 
variants of PLTIGs (L1R2, L2R1, L2R2). A 
PCFG with 23 nonterminals is included because 
its size approximates that of the two smaller 
PLTIGs. We did not generate random train- 
test splits for the WSJ corpus because it is large 
enough to provide adequate sampling. Table 
H presents our findings. From Table ||, we see 
several similarities to the results from the ATIS 
corpus. All three variants of the PLTIG formal- 
ism have converged at a faster rate and have 
far better language modeling scores than any of 
the PCFGs. Differing from the previous experi- 
ment, the PLTIGs produce slightly better cross- 
ing bracket rates than the PCFGs on the more 
complex WSJ corpus. At least 20 nonterminals 
are needed for a PCFG to perform in league 
with the PLTIGs. Although the PCFGs have 
fewer parameters, the rate seems to be indiffer- 
ent to the size of the grammars after a thresh- 
old has been reached. While upping the number 
of nonterminal symbols from 15 to 20 led to a 



22.4% gain, the improvement from PCFG20 to 
PCFG23 is only 0.5%. Similarly for PLTIGs, 
L2R2 performs worse than L2R1 even though it 
has more parameters. The baseline comparison 
for this experiment results in more extreme out- 
comes. The right branching heuristic receives a 
crossing bracket rate of 49.44%, worse than even 
that of PCFG15. However, the iV-gram models 
have better cross-entropy measurements than 
PCFGs and PLTIGs; bigram has a score of 3.39 
bits per word, and trigram has a score of 3.20 
bits per word. Because the lexical relationship 
modeled by the PLTIGs presented in this pa- 
per is limited to those between two words, their 
scores are close to that of the bigram model. 

4 Conclusion and Future Work 

In this paper, we have presented the results 
of two empirical experiments using Probabilis- 
tic Lexicalized Tree Insertion Grammars. Com- 
paring PLTIGs with PCFGs and iV-grams, our 
studies show that a lexicalized tree represen- 
tation drastically improves the quality of lan- 
guage modeling of a context-free grammar to 
the level of iV-grams without degrading the 
parsing accuracy. In the future, we hope to 
continue to improve on the quality of parsing 
and language modeling by making more use 
of the lexical information. For example, cur- 
rently, the initial untrained PLTIGs consist of 
elementary trees that have uniform configura- 
tions (i.e., every auxiliary tree has the same 
number of adjunction sites) to mirror the CNF 
representation of PCFGs. We hypothesize that 
a grammar consisting of a set of elementary 
trees whose number of adjunction sites depend 
on their lexical anchors would make a closer ap- 
proximation to the "true" grammar. We also 
hope to apply PLTIGs to natural language tasks 
that may benefit from a good language model, 
such as speech recognition, machine translation, 
message understanding, and keyword and topic 
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A Parameters of PLTIG 

Each elementary tree of a Probabilistic Tree In- 
sertion Grammar, denoted as p, has the follow- 
ing parameters: 

Pi(p): the probability that tree p is the start 
of a derivation (i.e., tree p does not adjoin 
or substitute into other trees). If p is an 
auxiliary tree, Pi(p) = 0. The grammars 
used in our experiments have exactly one 
empty initial tree with Pi(p e ) = 1. 

The parameters for adjunction and substitution 
are associated with each node of an elementary 
tree, denoted as r/. 

Pl(i]i Pl)' the probability of adjunction be- 
tween left auxiliary tree pl and node n 



Pnl(i]) : the probability that no tree left ad- 
joins into node rj such that 

P NL {r])+Y, P L{ri,PL) = l 

PL 

Pr(ViPr) : the probability of adjunction be- 
tween right auxiliary tree pn and node 77 

Pnr(i]) : the probability that no tree right ad- 
joins into node 77 such that 

P N R(v)+J2 P R( r ),PR) = l 

PR 

Ps(rj,ps)'- the probability that an initial tree 
ps can substitute into node 77. The gram- 
mars we used for our experiments have no 
substituion nodes, so this parameter is not 
used. 

B Inside-Outside Probabilities 

Let O = Oi,0%, . . . ,Ot be the observed se- 
quence we wish to parse with a PLTIG. To es- 
timate the likelihood of observing this sequence 
in the grammar and to maximize the parame- 
ters of the grammar to reflect the observations, 
we compute the inside and outside probabilities. 

B.l Inside Probabilities 

The inside probability of a node rj between po- 
sitions s and t, is the probability that node rj 
can generate the partial observations between s 
and t (i.e., O s +i, • • • , Of). This probability is de- 
noted as e(s,t, 77). We calculate e recursively in 
a bottom-up manner. The base cases are when 
a node is an empty node or a foot node, which 
does not cover anything; and when a node cov- 
ers a single lexical item (e.g., O s+ i). 



e(s,s,rj) 
e(s,s + 1,77) 



Foot{rj) or Label(ri)= 

otherwise 

Label(rj)=O s +i 

otherwise 



We now show the general case of computing 
the inside probability for a node 77 generating 
the sub-sequence of observations between po- 
sitions s and t. Following the model outlined 
in Schabes and Waters (1993b), we enforce the 
restriction that a node cannot have more than 
one left or right adjunction. More specifically, 
there are four ways that node r\ can generate 
the sub-sequence between positions s and t: 



1. e(s, t, 77, 0): the probability that 77 covers 
O s+ i, . . . ,Ot without any adjunction at rj. 

2. e(s,t,rj, L): exactly one left adjunction at 
77. 

3. e(s,t,rj, R): exactly one right adjunctions 
at 77. 

4. e(s, t, rj, LR) a simultaneous left and right 
adjunction at rj. 

The final value of e(s, t, rj) is the normalized sum 
of the four parts. 

£(s,t,rj) = P]VL(r?)Piv fl (»7)e(s,t,r),0) 

+ P NR (v)e(s,t, V ,L) 

+ P N L(vHs,t, V ,R) 

+ e(s,t,T],LR)/2 

For a node rj to cover the substring between 
positions s and t without any adjunction, it 
must be the case that its children jointly gener- 
ate O s +i, . . . , Ot^\ If node rj has only one child 
771, then 

e(s,t,r),@) = e(s,t,7fr) 

If node rj has two children such that 771 is the 
left child and 772 is the right child, then 

t 

e(s, t, rj, 0) = g (s, r, ?7i)e(r, t, rj 2 ) 

r=s 

Next, we consider the case when 77 generates 
O s+ i, . . . ,Ot by letting a left auxiliary tree ad- 
join into it. The auxiliary tree generates the 
front of the substring, O s+ \, . . . ,O r , and 77 gen- 
erates the rest of the substring, O r + 1, . . . , Of, 
without adjunctions. The breaking position r 
can be anywhere between s and t. 



e(s,t,rj,L) = £ E 

PL r=s+l 



xe(r,t,'rj,$) 
xPl(v,Pl) 



Similary, e(s, t, rj, R) represents the case when a 
right auxiliary tree, pn, is adjoined into node 77 
to generate O s+ \, . . . , Ot- 

t_i / S(s,r,p R ) 
e(s,t,rj,R) = EE xe(r,t,ij,0) 
PRr=S V xP R (^PR) 



2 Or, if r] were a substitution node, then there must be 
a tree that generate the substring and can be substituted 
into r\. We did not include the equations for this case 
because our grammars have no substitution nodes. 



Finally, if both a left auxliary tree and a 
right auxiliary tree adjoin into node rj simul- 
taneously to generate O s+ i,...,Of, then we 
have to consider two breaking positions rl 
and rl. The left auxiliary tree, pz, gen- 
erates O s+ \, . . . ,O r i; the node rj generates 
O r l+i 3 • • • > O r 2 (without adjunction); and the 
right auxiliary tree, pa, generates the remaining 
observations O r 2+i, • • • , Of. 



e(s, t, rj, LR) 



t-i 



t-i 



EE E E 

Pi Pi? ri=s+l r2=ri 



( e(s,n,p L ) 
xe(ri,r 2 ,??,0) 
xe(r 2 ,t,p R ) 
xPr(.V,Pr) 
\ xP l (wl) 



B.2 Outside Probabilities 



The outside probability of a node rj between po- 
sitions s and t, denoted as f(s, t, rj), is the prob- 
ability that the derived tree will generate rj and 
the two partial observations outside of s and 
t (i.e., the two sub-sequences Oi,...,O s and 
Ot+i, ■ ■ ■ , Ot)- The outside probabilities com- 
plement the inside probabilities: the product of 
the matching inside and outside probabilities is 
the total probability of the observation sequence 
being generated by the grammar. Similar to the 
constructs of the inside probabilities, we define 
four types of outside probabilities: 

1. f(s, t, rj, 0): the probability that rj is gener- 
ated without having any tree adjoining into 
it. 

2. f(s,t,rj,L): rj is generated and a left aux- 
iliary tree has adjoined into it. Moreover, 
the auxiliary tree does not cover any part 
of the substring between s and t. 

3. f(s, t, rj, R): rj is generated and a right aux- 
iliary tree has adjoined into it. Moreover, 
the auxiliary tree does not cover any part 
of the substring between s and t. 



Finally, f(s,t,rj) is the normalized sum of its 
four parts. 

f(s,t,rj) = P N L(v)PNR(v)f(s,t, v ,9) 

+ PNR(v)f(s,t,ri,L) 

+ P N L.(v)f(s,t,V,R) 
+ f(s,t, V ,LR)/2 

The outside probabilities are computed recur- 
sively in a top-down manner. The base case is 
the probability that the root node rj of an initial 
yree p is generated. This is equal to the prob- 
ibility of the initial tree being the start of the 
lerivation. 



f(o,T,n, 



pi(p) 

o 



IsRoot(rj,p) 

otherwise 



To compute the outside probability of a node 
without any auxiliary trees adjoining into it, we 
consider five f] different tree configurations. 

• node rj is the only child of its parent node, 
770- Because no adjunction took place at 
rj, its outside probability is the normalized 
outside probability of its parent node. 

f{s,t,rj,%) = f(s,t,rj ) 

• node rj is the left child of node rjo and rj 
has a sibling 772, the right child of r/o- The 
outside probability of rj between positions 
s and t is the product of the outside proba- 
bility of its parent node between positions 
s and r, where t < r < T and the normal- 
ized inside probability of its sibling node 7/2 
deriving the substring between t and r. 

T 

f(s, t, rj, 0) = ^ /( s > r > Vo)e(t, r, rj 2 ) 

r=t 

• node rj is the right child of node r/o and rj 
has a sibling rji, the left child of tjq. 



f(s,t,rj, 



J{r,t,rjo)e{r,s,rn) 



4. f(s, t, rj, LR): rj is generated and a left aux- 
iliary tree and a right auxiliary tree have 
simultaneously adjoined into it. Neither 
auxiliary tree can cover any part of the sub- 
string between s and t. 



• node rj is the root node of a left auxiliary 
tree pl that left adjoins into a node r/o- 

3 For the work presented here, we do not consider the 
sixth case in which the node is the root of an initial tree 
that might substitute into a substitution node. 



Suppose that node rjo derives the substring 
between positions t and r, where t < r < T. 
Then the outside probability of 77 between 
s and t is the product of the outside proba- 
bility of 770 between s and r and the inside 
probability of 770 deriving the observations 
between t and r without left adjunction. In 
order for pi, to left adjoin into 770, rjo must 
not have previously left adjoined with any 
tree. Therefore, the inside probability of 770 
between t and r cannot include e(t, r, r/o,L) 
or e(t,r,r] ,LR). 




PL(vo,pL)f(s,r,Tio,®) 
e(t,r,r)o,Q)PNR(vo) 
+e(t,r,r) ,R)/2 



node 77 is the root node of a right auxil- 
iary tree pn that right adjoins into a node 
r/o- Suppose node 770 generates the par- 
tial string between position r and s, where 
< r < s; then the outside probability of 77 
between s and t is the product of the out- 
side probability of 770 between r and t and 
the inside probability of 770 between r and 
s without any right adjunction. 




e(r,s,r) $)P NL (r) ) 
+e(r,s,rio,L)/2 



The remaining three types of outside proba- 
bilities are the cases in which auxiliary trees are 
adjoined into node 77. First, we consider the case 
of left adjunction. Let tree pi be an auxiliary 
tree that is to be adjoined into 77. pi must de- 
rive the partial observation immediately before 
position s (i.e., O r , . . . , O s , where < r < s). 



f(s,t,r),L) 




x/(r,t,?7,0) 



Similarly, if a right auxiliary tree, pr, is to be 
adjoined into node 77, then it must derive the 
partial observation immediately after position t 
(i.e., Of+i, . . . , O r , where t <r <T). 



f(s,t,r],R) 




xf(s,r,ri,d) 
xPR(ri,PR) 



tree pr are adjoined into node 77 such that p L 
covers a partial string from position r\ to s and 
Pr covers a partial string from position t to r-z, 
where < r\ < s and t < r2 < T. 



f(s,t, V ,LR) 



EEEE 

PL PR ri=0r 2 =t 



/ e(n,s,p L ) \ 
xe(t,r 2 ,pfl) 
x/(ri,r 2 ,r?,0) 
xPr(v,Pr) 
\ xP l (v,Pl) J 



Finally, in the case of simultaneous adjunction, 
both a left auxiliary tree pi and a right auxiliary 



